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Abstract. 

Sorting and hashing are two completely different concepts in computer science, and appear 
mutually exclusive to one another. Hashing is a search method using the data as a key to map to the 
location within memory, and is used for rapid storage and retrieval. Sorting is a process of organizing 
data from a random permutation into an ordered arrangement, and is a common activity performed 
frequently in a variety of applications. 

Almost all conventional sorting algorithms work by comparison, and in doing so have a linearith- 
mic greatest lower bound on the algorithmic time complexity. Any improvement in the theoretical 
time complexity of a sorting algorithm can result in overall larger gains in implementation perfor- 
mance. A gain in algorithmic performance leads to much larger gains in speed for the application 
that uses the sort algorithm. Such a sort algorithm needs to use an alternative method for order- 
ing the data than comparison, to exceed the linearithmic time complexity boundary on algorithmic 
performance. 

The hash sort is a general purpose non-comparison based sorting algorithm by hashing, which has 
some interesting features not found in conventional sorting algorithms. The hash sort asymptotically 
outperforms the fastest traditional sorting algorithm, the quick sort. The hash sort algorithm has a 
linear time complexity factor - even in the worst case. The hash sort opens an area for further work 
and investigation into alternative means of sorting. 

1 . Theory. 
1.1. Sorting. 

Sorting is a common processing activity for computers to perform. Data that 
is sorted is of the form where data items have an increasing value - ascending, or 
alternatively decreasing value - descending. Regardless of the arrangement form, 
sorting establishes an ordering of the data. The arrangement of data from some 
random configuration into an ordered one is necessary often in many algorithms, 
applications, and programs. The need for sorting to arrange and order data makes 
the space and temporal complexity of the algorithm used very paramount. 

A bad choice of algorithm for sorting data by a designer or programmer can 
result in mediocre performance in the end. With the large need for sorting, and the 
important concern of performance, many different types and kinds of algorithms for 
sorting have been devised. Some algorithms used such as quick sort and bubble sort 
are very widespread and often used. Other algorithms such as bin sort and pigeonhole 
sort are not as widely known. 

The plethora of sorting algorithms available does not change the tantamount 
question of how fast it is possible to sort. This is a question of temporal efHciency, 
and is the most significant criterion for a sort algorithm. Along with it, what will 
affect the temporal efficiency and why it does is just as important a concern. Still 
another important concern, though more subtle, is the space requirements for using 
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the algorithm space efficiency. This is a matter of algorithm overhead and resource 
requirements involved in the sorting process. 

All sort algorithms for the most part have the temporal efficiency and causes for 

changes in it are well understood. This limit or barrier to sorting speed is a greatest 
lower bound which is 0{N log N). In essence, the reasoning behind this is that sorting 
is a comparative or decision making process of N items, which forms a binary tree 
of depth log N . For each data item, a decision must be made where to move it to 
maintain the ordering property desired. With N data items, and log N decision time, 
the minimum speed possible is the product, which is N log N. This lower bound is 
often never reached in practice, it is often a multiplied or added constant away from 
the theoretical maximum. 

There is no theoretic greatest lower bound for space efficiency, and often this 
complexity measure characterizes a sort algorithm. The space requirements are highly 
dependent on the underlying method used in the sort algorithm, so space efficiency 
directly reflects this. While no theoretical limit is available, an optimum sort algo- 
rithm will require A''-|- 1 storage. This is the most optimal because N data items, and 
one additional data item used by the sort algorithm. A bubble sort algorithm has 
this optimal storage efficiency. However, the optimal space efficiency is subordinate 
to the temporal efficiency in a sorting algorithm. The bubble sort, while having an 
optimal space efficiency, is well reputed to be very poor in performance, far above the 
theoretical lower bound. 

1.2. Hashing. 

Hashing is a process of searching through data, using each data item itself as a 
key in the search. Hashing does not order the data items with respect to each other. 
Hashing organizes the data according to a mapping function which is used to hash 
each data item. Hashing is an efficient method to organize and search data quickly. 
The temporal efficiency or speed of the hashing process is determined by the hash 
function and its complexity. 

Hash functions are mathematically based, and a common hash function uses the 
remainder mod operation to map data items. Other hash functions are based on 
mathematical formulas and equations. The construction of a hash function is usu- 
ally from multiplicative, divisional, additive operations, or some mix of them. The 
choice of hash function follows from the data items and involves temporal and space 
organization compromises. 

Hashing is not a direct mathematical mapping of the data items into a space 
organization. Hashing functions usually have a phenomena of a hash collision or 
clash, where two data items map to the same spatial location. This is detrimental 
to the hash function, and many techniques for handling hash collisions exist. Hash 
collisions introduce temporal inefficiency into a hash algorithm, as the handling of a 
collision represents additional time to re-map the data item. 

A special type of hash function, a perfect hash function, exists and is perfect in 
the sense it will not have collisions. These types of perfect hash functions are often 
available only for narrow applications, reserved words in a compiler symbol table for 
example. Perfect hash functions usually involve restricted data items and do not fit 
in to a general purpose hash method While perfect hashing is possible, often a regular 
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hash function is used with data in which the number of coUisions is few or the coUision 
resolution method is simple. 

2. Algorithm. 

2.1. Super-Hash Function. 

The heart of the hash sort algorithm is the concept of combining hashing along 
with sorting. This key concept is embodied in the notion of a super-hash function. 
A super-hash function is "super" in that it is a hash function that has a consistent 
ordering, and is a perfect hash function. A hash function of this type will order 
the data by hashing, but maintain a consistent ordering and not scatter the data. 
This hash function is perfect so that as the data is ordered it is uniquely placed when 
hashed, so that no data items ambiguously share a location in the ordered data. Along 
with that, the necessity for hash collision resolution can be avoided altogether. 

The super-hash function is a generalized function, in that it is extendible math- 
ematically, and is not a specialized type of hash function. The super-hash function 
operates on a data set which is of positive integer values. With a super-hash function, 
the restrictions on the data set as the domain is that it is within a bounded range, 
between a minimum and maximum value. The only other restriction is that each in- 
teger value be unique - no duplicate data values. This highly closed set of data values 
as positive integers is necessary to build a preliminary super-hash function. Once a 
mathematically proven super-hash function is formulated, then other less restricted 
data sets can be explored. 

A super-hash function as described and within the parameters is not a complex 
or rigorous function to define. A super-hash function uses the standard hash function 
using the modulus or residue operator. This operator is integer remainder, called 
mod, is a common hash fimction in the form (x mod n). The other part of the 
super-hash function is another hash function called a mash function, for modified 
hash or magnitude hash. The mash function uses the integer division operator. This 
operator, called div, is used in the form similar to the hash of (x div n). Both of 
these functions, the mash function and the hash function, together form the super- 
hash function. This super-hash function is mathematically based and extensible, and 
is also a perfect hash function. For the super-hash function to be perfect, it must be 
an injective mapping. 

The super-hash function works using a combination of a regular hash function 
and the mash function. Together both of these functions make a super-hash function, 

but not as the composition of the two. Both the hash function, and mash function are 
sub-functions of a super-hash function. The regular has function (x mod n) works 
using the remainder or residues of values. So numbers are of the form c- x + r, where 
r is the remainder obtained using the regular has function. When hashing by a value 
n, the resulting hashes map from the range to n — 1. In essence, a set of values is 
formed so that each value in the set is {c-a:; + 0, c-a:; + l, . . . , c-x+ (n— 2), c-x-|-(n — 1)}. 

A hash function provides some distinction among the values, using the remainder 
or residue of the value. However, regular hashing experiences collisions, or where 
values hash to the same value. The problem is that values of the same remainder 
are indistinguishable to the regular hash function. Given a particular remainder r, all 
values which are multiples of c are equivalent. So a set of the form {ci ■ x + ri,C2 ■ 
x + ri,. . . , Cn-i ■ X + ri,Cn ■ X + ri}. So for n = 10, r = 1 the following set of values 
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are equivalent under the regular hash function: {1, 11, 21, 31, . . . 1001, 10001, c- a;+ 1}. 
It is the equivalence of the values under regular hashing which causes collisions, as 
values map to the same hash result. There is no distinction among larger and smaller 
values within the same hash result. Some partitioning or separation of the values is 
obtained, but what is needed is another hash function to distinguish the hash values 
further by their magnitude relative to one another in the same set. 

This is what the mash function is for, a magnitude hash. A mash fimction is 
of the same form as a regular hash function, only it uses div rather than mod for 
(x mod n). The div operator on the form of a value c - x + r gives the value of c, 
where x is the base (usually decimal, base 10). The mash function maps values into a 
set where the values mash to the same result, based upon the magnitude. So the mash 
function shares the same problem as a regular hash that all values are mapped into an 
equivalent set. So a set of the from {ci - .x + ri , ci •x + r2, . . . , ci -x + rn-i, ci -x + rn} has 
all values mashed to the same result. With n = 10, c = 3 the following set of values 
are equivalent under the mash function {30, 31, 32, 33, 34, 35, 36, 37, 38, 39}. With the 
mash function, some partitioning of the values is obtained, but there is no distinction 
among those values which is unique to the values. 

Together, however, a hash function, and mash function can both distinguish val- 
ues, by magnitude, and by residue. The minimal form of this association between the 
two functions is an ordinal pair of the form (c, r) where c is the multiple of the base 
obtained with the mash function, and r is the remainder of the value obtained with the 
hash function. Essentially an ordinal pair form a unique representation to the value, 
using the magnitude constant, and residue. Further, each pair distinguishes larger 
and smaller values, using the mash result, and equal magnitudes are distinguished 
from each other using the residue. So the mapping is a perfect hash, as all values 
are uniquely mapped to a result, and is ordering, since the magnitude of the values 
is preserved in the mapping. A formal proof of this property, an injective mapping 
from one set to a resulting set, is given. 

The values for n in the hash function and mash function are determined by the 
range of values involved. The proof gives the validation of the injective nature of the 
mapping, and the mathematical properties of the parameters used, but no general 
guidelines for determining the parameters. The mathematical proof only indicates 
that the same mapping value must be used in the hash and mash functions. 

Multiple iterations of mapping functions can be used, so multiple mapping values 
can be used to form a set of mapping pairs. The choice of values for the mapping 
values depends on the number of dimensions to be mapped into, and the range of 
values that the mapping pairs can take on. Numerous, smaller mapping values will 
produce large mapping ordinates for the mash function, and small values for the hash 
function. This would be a diverse mix of values, but it depends upon the use of the 
hash sort and what is desired by the user of the algorithm. 

The organization of the data elements in the matrix can be of row-major or 
column-major order. The mapping into a column and row by the hash and mash 
function determines if the matrix is row or column major mapping. A row-major 
order mapping would have the rows mapped by the mash function, and the columns 
mapped by the hash function. A column major order mapping would interchange the 
mash and hash functions for the rows and columns respectively. 
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2.2. Construction of the Super-Hash Function. 



2.2.1. Method of Construction. The super-hash function consists of a regular 

hash function using the mod operator, and a mash function using the div operator. 
Together these form ordinal pairs which uniquely map the data element into a square 
matrix. The important component of the super-hash function to determine is the 
mapping constant 0. 

To determine O, it must be calculated from the range of the values that are to 
be mapped using the super-hash function. The range determines the dimensionality 
or size of the matrix, which is square for this super-hash function. 

Given a range R{i,j] where i is the lower-bound, and j is the upper-bound, then: 

1. Compute the length L of the range where L = {j — i) + 1 . 

2. Determine the nearest square integer to L. The nearest square is calculated 
by: 

e=\VL-] 

The final value computed is the nearest square to L, which is the length of the 
range of values. In essence the values are being constructed in the form of a number 
which is tailored to the length of the range of values. 

value(^dx,mx) = dx ■ & + mx 

where O is determined by the range of the values to be mapped. 

2.2.2. Example of Constructing a Super-Hash Function. As an example, 
suppose you have a range of values from 13 to 123. The length of the range is 123 - 

13 + 1. The length of this range is 111. 

The nearest square is then calculated as follows: 

e=^^/ml 

which evaluates as: 

e = [10.53565375 . . .] 



e = 11 

Part of the mapping involves subtracting the lower bound of the range, so that 
all the values are initially mapped to the range 0..(j — i). So the super-hash function 
for this range of data values is: 



F{x) 



d ={x- 13) div 11 
m = {x — 13) mod 11 



For the lowest value, 13 maps to (0,0). The largest value 123 maps to (10,0). 
Reconstructing the values from the ordinal pairs is of the form: 

value(Ax, rux) = dx ■ ^ -\- rux + i 
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so that for (0,0) 



value^O, 0) = (0 • 11 + 0) + 13 = 13 

and for (10,0) 

vaZwe(10, 0) = (10 • 11 + 0) + 13 = 123 

One final point is that constructing a super-hash function requires knowledge of 
the range of possible values of the data elements. All data types in computer languages 
have a specific range of values for the different types, such as int, float, unsigned int, 
long to list types from the C programming language. While these types are often 
used without needing to know the range of values, there does exist such a property 
on those values and variables defined of that particular type in a program. 

2.3. In-situ Hgish Sort Algorithm. 

The hash sort algorithm uses the super-hash function iteratively on an entire data 
set within the range of the super-hash fmiction. This is the in-situ version of the hash 
sort, which works "in site" . Before the iterative process, an initialization is performed. 
A source value is retrieved, and is mapped by super-hash to another location. At the 
destination location, the destination value is exchanged with the source one, and 
stored at the location. The new value now is a source value. The algorithm process 
then repeats iteratively for a source, destination value. The algorithm terminates at 
the end of the list. 

Pseudo-code illustrating a very generalized from of the in-situ hash sort is: 

(mi, m2, • • • , TO„_i, TO„) < — initialize; 

WHILE NOT ( end.ofJist ) DO 

temp< — get{mi,m2, - ■ ■ ,mn-i,mn); 
value — > put{mi,m2, ■ ■ ■ ,m„_i,m„); 
value = temp ; 

( mi,m2, • • • ,mn-i,mn ) — > super hash{temp); 
END WHILE ; 

Pascal Version: 

Procedure hash_sort (var list: data_list; n: integer); 
Const 

D=10; (* D is the dimension size for a 10 x 10 matrix *) 
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Var 

x_c, h_c: integer; 
which, where: integer; 
value: integer; 
Begin 

x_c := 0; (* set the counts to zero *) 
h_c := 0; 

where := 0; (* set the initial starting points to zero *) 
while ( (x_c < n) AND (h_c < n) ) do 

begin (* loop until exchange and hysteresis count equal data size *) 
value := List [where div i, where mod j] ; (* get a value *) 
if (value = where) then (* check for hysteresis *) 

begin 

where := where +1; (* on hysteresis move where to next position *) 
h_c := h_c +1; (* on hysteresis increment hysteresis count *) 

end 

else (* if no hysteresis, swap values and increment exchange count *) 
begin 

List [where div D, where mod D] := List [value div D, value mod D] ; 
List [value div D, value mod D] := value; 
x_c := x_c + 1; 
end; 
end; 
End; 

C version: 

#define DIM 10 /* dimension size for the matrix */ 

void hash_sort(int& list [DIM] [DIM] , int n){ 
int x_c = 0,h_c = 0; 
int which = 0, where = 0; 
int value; 

while ( (x_c < n) && (h_c < n) ){ 

value = list [where / DIM] [where % DIM] ; 

if (value == where) -[ 

where++; /* on hysteresis move where to next position */ 

h_c++; /* on hysteresis increment hysteresis count */ 

} else { 

/* if no hysteresis, swap values cind increment exchange count */ 
list [where / DIM] [where 7, DIM] = list [value / DIM] [value % DIM] ; 
list [value / DIM] [value % DIM] = value; 
x_c++; /* increment the exchange count */ 

} 

} 

> 

Generic Pseudo-code of operations involved in in-situ version of algorithm: 
PROCEDURE HASH_SORT; 
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Initialize variables; 



WHILE ( Counts < Data Size ) DO 
Mathematically process value; 
Compute destination in eirray; 

IF hysteresis THEM 

Move where to next location; 
Increment hysteresis count; 

ELSE 

Exchange value with destination; 
Increment exchainge count ; 
END IF; 
END WHILE; 

END PROCEDURE: 

The hash sort is a simple algorithm, the complexity is in the super-hash function 
which is the key to the algorithm. The remaining components of the algorithm are 
for the exchange of values, and continuous iteration until a termination condition. 

The hash sort is a linear time complexity sorting algorithm. This linear time 
complexity stems from the mapping nature of the algorithm. The hash sort uses 
the intrinsic nature of the values to map them in order. The mapping function is 
applied repeatedly in an iterative manner until the entire list of values is ordered. 
The mathematical expression for the run-time complexity is: 

F{time) = 2-c- N 

where c > 1 , and N is the size of the list of data values. 

The function for the run-time complexity is derived from the complexity of the 
mapping function, and the iterative nature of the algorithm. The mapping function is 
a composite function, with two sub- functions. Multiple applications of the mapping 
function arc possible because of the extendible nature of the hash sort, but at least 
one mapping is required. Hence, the constant c is a positive integer value, which 
represents the number of sub-mapping within the mapping function. 

The mapping function uses two sub- functions as a composite, so the overall time 
for the mapping is the product of two multiplied by the number of sub-mappings. 
The value of the product of two multiplied by the number of sub-mappings is always 
greater than one, hence it is the dimension of the hash sort. The constant c is not 
dependent on the data values or the size of the data values, it is an implementation 
constant; once the constant is chosen it is unaltered. 

The mapping function must be applied iteratively to the range of values forming 
the data list. This makes the overall time complexity of the hash sort the product of 
the complexity of the mapping function multiplied by the size of the range of the list 
of values. The hash sort remains multiple dimension, and linear in time complexity. 
The time complexity of the hash sort is then: 
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F{time) = 0{N) 



The space complexity of the hash sort is considering the storage of the data 
values, and not the variables to used in the mapping process. The space complexity 
of the hash sort is dependent upon which variant of the hash sort is used. The in-situ 
hash sort, which uses the same multi-dimensional data structure as it maps the values 
from the initial location to the final location, requires (A'^ + 1) storage space. The data 
structure is the size of the range of values, which is A''. One additional storage space 
is required as a temporary storage location as values are exchanged when mapped. 
The space complexity for the in-situ hash sort is: 



The direct hash sort maps from a one-dimensional data structure to the multi- 
dimensional data structure of the same size. No temporary storage is required, as no 
values are exchanged in the mapping process, the values are directly mapped to the 
final location within the multi-dimensional data structure. The two data structures 
are organized differently, but are of the same size N. Thus, the direct hash sort 
requires 2 • A'' in terms of space complexity. The space complexity for the direct hash 
sort is: 



Each variation of the hash sort has different properties, but the space complexity 

is {N + 1) for the in-situ hash sort, and 2 • for the direct hash sort. For each 
variation, the space complexity is linearly related to the size of the list of data values. 

The hash sort performance asymptotically outperforms conventional sorting al- 
gorithms (such as quick sort), which are NlogN time complexity performance. This 
is readily apparent from a simple inequality involving the ratio of the two algorithms. 
As the size of the data A^ increases without bound, then the ratio between the hash 
sort and a quick sort should be less than one. 

If the ratio is greater than one, the time complexity of the hash sort is greater 
than the quick sort. If the ratio is exactly one then the two sort algorithms perform 

with the same complexity. A ratio of less than one indicates the hash sort is less than 
the time complexity of the quick sort, therefore outperforming it. 



F{space) = 0{N) 



F{space) = 0{N) 



lim 2 • A: • iV c 
N^+oo N ■ logN 



< 1.0 



which simplifies to: 



2-k 
lim — — 

— >-|-c» lo 




< 1.0 



taking the limjv^-i-oo the ratio then becomes: 



which simplifies to: 



^ <1.0 

oo 

which then reduces to: 

0.0 < 1.0 

showing that the ratio of the two algorithms is indeed less than one. 

Therefore this means the hash sort will asymptotically outperform the quick sort. 

2.4. Variations of the Hash Sort. There are two types of versions of the 
hash sort, the in-situ hash sort, and the direct hash sort. Variations upon the hash 
sort are upon these two primary types. The in-situ hash sort is the basic form of the 
hash sort, which works by exchanging values as explained previously. The in-situ hash 
sort has a problem which can increase its time-complexity, but the hash sort algorithm 
remains linear. The in-situ hash sort has a problem with data values that map to 
their current location. In essence, the data value is already in-site, where it belongs. 
Since the in-situ hash sort will determine where it belongs, then exchange, this would 
cause the in-situ hash sort to halt. To remedy this, another iterative mechanism keeps 
the in-situ hash sort algorithm going by relocating the current location to the next 
one. When the current location and destination location are the same, the in-situ 
hash sort has to be forced to proceed. 

This forcing of the in-situ hash sort does add more time-complexity, but as a 
linear multiplied constant. The number of data elements that map back to the cur- 
rent location they are at is the amount of hysteresis present in the data. The term 
hysteresis is borrowed from electrical theory, meaning to lag; hysteresis in the in- situ 
hash sort causes a lag, which the algorithm must be forced out of. The worst case 
for hysteresis is that all of the data elements map to the location they are at. In this 
case, the in-situ hash sort would have to be pushed along until it is through the data 
elements. In this worst case, the time of the algorithm becomes a double of the linear 
time complexity, increasing the time, but linearly to 2(2 • fc) • iV, or 4 • fc • A'^. The in-situ 
hash sort is more space efficient, using only the space to store the data elements, and 
a temporary storage for sorting the data elements, or A'' -|- 1. 

The direct hash sort is a variation upon the in-situ to avoid the problem with 
hysteresis. The direct hash sort \iscs a temporary data array to hold the data ele- 
ments. The data elements are then mapped from the single dimension array into the 
multiple dimension array. No element can map to its current location, as it is be- 
ing mapped from a one-dimensional array into a multiple dimensional. However, the 
storage requirements for the direct hash sort require twice the storage, ov 2 ■ N . The 
tradeoff for time efficiency is a worsening of the space efficiency. The time complexity 
is 2 • fc • A'', as the problem with hysteresis never surfaces. 

The variations in the hash sort can be applied to the two primary forms, the 
in-situ and the direct hash sort. The variations are in dimensionality, and relaxing 
a restriction imposed upon the data set that is sorted. The version of the hash sort 
mentioned is two-dimensional, but the hash sort can be of any dimension d, where 
d>2. 
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The map by hashing would then be muhiplc apphcations of the hash scheme. For 
a d-dimcnsional hash sort, it will be of the form 2 ■ k ■ N , where k is the dimension- 
ality of the hashing structure. Note that k is an implementation constant, so once 
decided upon, remains a constant multiple of 2. Either primary type of hash sort 
algorithm, the in-situ hash sort, or the direct hash sort, can be extended into higher 
dimensionality. 

The other variation upon the primary forms of the hash sort concerns the restric- 
tion of unique data values. This restriction upon the data set can be relaxed, but with 
each location in the hash structure a count is required of the number of data elements 
that hash to that location. For large amounts of values in the data set, the hash 
sort will "compress" them into the hash structure. When the original sorted data set 
is required, each data element would be enumerated by the number of hashes to its 
location. The flexibility of the hash sort to accommodate non-unique data values is 
inefficient if there are few repeated data values. In such a case, nearly half of the hash 
structure will be used only for single data values. 

Both the in-situ hash sort and the direct hash sort have another problem, which 
is inherent in either variant of the hash sort algorithm. If the data values within the 
range are not all there, then the data set is sparse. The range is used to determine 
the hash structure size, and the time to run the algorithm. The hash sort presumes 
all data values in the data set within the range are present. If not, then the hash 
sort will still proceed sorting for the range size, not the data set size. So for the data 
set size N^, and the range size Nj., if > Nj., the hash sort algorithm performs as 
expected, or better. 

If Nd < Nr, then the sparsity problem surfaces. The hash sort will sort on a 
range of values, of which some are non-existent. The smaller data set size to range 
size will not be reflected in the time-complexity of the hash sort, or in the required 
hash structure; so when the cardinality of the data set and cardinality of the range of 
values within the data set are inconsistent, the hash sort doesn't flop, but it becomes 
inefficient. The empty spaces within the hash structure then must have some sentinel 
value outside the data range to distinguish them from actual data values. The direct 
hash sort alleviates the sparsity problem somewhat for the hash sort time, but still is 
inefficient in the hash structure as some of the locations will be blank or empty. 

Code for the direct hash sort algorithm is: 

C version: 



#define DIM 10 /* dimension size of the matrix */ 



typedef struct tag { 
int count; 

int value ; 
y element ; 



element matrix M[DIM] [DIM] ; 



void hash_sort(list L[], element M[ ] [DIM] , int size) 
{ 
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int value, x; 
int row, col; 



for(x=0;x < size;x++) 
{ 

/* get the value eind detemine its row, column location */ 
value = L [x] ; 

row = value 7. DIM; 
col = value / DIM; 

if (M [col] [row] == value) 

/* if the value is already here, increment count */ 

M[col] [row] .count++; 
else { 

/* store the value, initialize the count to 1 */ 
M [col] [row] . count = 1 ; 
M[col] [row] .value = value; 

} 

} 

} 

Pascal version: 

const 

d = 10; 
size = 100; 

type 

record tag = 

count: integer; 

value: integer; 
end; 

matrix = array [d,d] of tag; 
list = array [size] of integer; 

procedure hash_sort (var matrix M; list L; integer size) 
var 

value : integer ; 
row, col: integer; 
x: integer; 
begin 

(* go through list and map values into matrix *) 
for x:= 1 to size do begin 

(* get current value, determine row and column location *) 

value : =L [x] ; 

row:=value div x; 

col:=value mod x; 

if M[col,row]=value then 
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(* if value already here *) 
M [col, row] . count :=M [col, row] .count + 1; 
else begin 

(* if maps first time, store value, initialize count to 1 *) 
M [col , row] . count : =1 ; 
M [col , row] . value : =value ; 
end; (*if*) 

end; (*for*) 

end; (*hash_sort*) 



2.5. Differences between Hash Sort Variants. 

2.5.1. Distinction of Concepts. There are three distinct concepts involved in 
the hash sort which need to be distinguished from one another. Each has an important 
significance relating to the hash sort, and is explained as it is identified. 

These three concepts are: 

1. Number of data elements N 

2. Range of the data values from a lower to upper bound R 

3. Square matrix which is the data structure M 

The number of data elements N is the total data size to be mapped by the hash 
sort. The number of data elements determines the time-complexity of the hash sort, 
the amount of time being linearly proportional to the number of data elements. 

The in-situ hash sort, the number of data elements is less or equal to the size of 

the square matrix M. The direct hash sort, the mimbcr of elements can be of any 
size. The only requirement is that the data elements fall within the range R. 

The square matrix M is constructed around the range R of the data values so that 
all possible values within the range R do map. The super-hash function maps data 
elements within the range R into the square matrix M . Therefore, the storage space 
M is formed from the range R of the data values, not the number of data elements 
TV. 

The time complexity is dependent upon the number of data elements, the size N. 
The in-situ hash sort can have less elements N than the matrix capacity M, but the 
hash sort algorithm must go through the entire matrix M. Hence sparsity of data 
values within the matrix is highly inefficient. The direct hash sort is different since 
the mapping is from a list into the matrix M. The time complexity is linear again to 
the size N of the list of elements. 

In both cases, there is linear time complexity, only with the in-situ, it is dependent 
more upon the size of the matrix M than the amount of data in the matrix. In doing 
so, the time complexity is more dependent upon the range of values for the in-situ 
than the number of them. No more data elements can be stored in the matrix than 
its size M permits, which is related to the range R. 
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2.5.2. Analysis of Hash Sort Variants. Depending upon the variant used, 

the hash sort can have a time complexity hnearly proportional to cither the size of the 
data structure which is a matrix M, or the size of the data list L. In each variant of 
the hash sort, the linear nature of the algorithm is proportionate to a greatest lower 
bound. 

The in-situ variant of the hash sort is linearly proportional to the size of the data 

structure, the matrix M. The matrix M has a size determined in part bythe super- 
hash function. The size of the matrix places a least upper bound on the possible size 
of the list L, and in doing so forms a least upper bound of 0(M) time complexity. 
This least upper bound constraint stems from the fact that the data list L and the 
data structure the matrix M are the same entity. 

The direct variant of the hash sort is linearly proportional to the size of the data 
list L, as the data structure the matrix M and the list L are seperate entities. The 
time complexity is independent of the data structure in the direct hash sort. This 
independence of entities permits a greatest lower bound dependent on the size of the 
data list L and not the data structure the matrix M. 

For each variant of the hash sort, the time complexity of the algorithm is linear. 
The linearity is constrained in relation to two seperate determining factors for each 
variant of the hash sort. In one variant, the in-situ version of the hash sort, the time 
complexity has a least upper bound determined by the matrix M. The size of the 
data list L in M can be less than, but no more than the size of the matrix M because 
the data list L and the data structure M are the same entity. The direct hash sort 
variant seperates these two entities and in doing so the constraint is a greatest lower 
bound dependent upon the size of the data list L. 



A table summarizing these distinctions is given below: 



Variant 


Big-Oh 


Constraint 


Comment 


In-situ 


0(M) 


Data Structure Size 


0(M) < 0{N) 


Direct 


0(N) 


Data List Size 


< 0{N) 



2.5.3. Example Walk-through of Hash Sort Variants. 
In-situ hash sort example with 2-dimensional 3x3 matrix 



Matrix: 

m = 1 2 

d 

* * * 

1 * * * 

2 * * * 



The values in the matrix must be mapped to the range .. 8; this range of values 
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forms a ordinal pair of the form {d,m) from (0,0) to (2,2). Values in the matrix 
increase from left to right m = .. 2, and top to bottom d = .. 2. This is how the 
ordering should place values once they are sorted. 

Any range of values can be used, but for simplicity, the range of values in the 
matrix is from 1 to 9. The hash sort algorithm used will be in-situ (or in site) so the 
problem of hysteresis is present. Hysteresis is the occurrence where a value is in its 
correct location, so the algorithm maps it back on to itself. 

The hash sort will involve starting at some initial point in the matrix, then travers- 
ing the matrix and mapping each value to its correct location. As each values is 
mapped, order is preserved (a large value will map to a "higher" position than a 
"lower" one), and there are no collisions (excepting self-collisions which are hystere- 
sis). The values are mapped by subtracting one, then applying the mod and div 
operators. 

The where, or which digit is being handled is in parentheses, the computed desti- 
nation is in brackets. The value used in the computation is given, and the computed 
ordinal pair. A before and after illustration is given to show the exchange of the two 
values which is computed, then done by the hash sort. 

The super-hash function for this data set is: 

d = (a; — 1) div 3 ; m = (a; — 1) mod 3 



m = 1 2 

d 

5 8 1 

1 9 7 2 

2 4 6 3 



Matrix Initial Configuration 



m = 1 2 

d 

(5) 8 1 

1 9 [7] 2 

2 4 6 3 

Before 



m = 1 2 

d 

(7) 8 1 

1 9 5 2 

2 4 6 3 

After 



The start position is initially at (0,0). The value is 5, subtracting 1 is 4. The mapping 
(illustrated once) is d = (4 div 3) = 1, m = (4 mod 3) = 1. Thus the computed 
destination for where the value goes is (1,1). 
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m = 012 m = 012 

d d 

(7) 8 1 (4) 8 1 

1952 1952 

2 [4] 6 3 2 7 6 3 

Before After 

The next value is 7, subtracting 1 is 6. The computed destination for where the value 
goes is (2,0). 

in = 012 in = 012 

d d 

(4) 8 1 (9) 8 1 

1 [9] 5 2 1 4 5 2 
2763 2763 

Before After 

The next value is 4, subtracting 1 is 3. The computed destination for where the value 
goes is (1,0). 

m = 012 in = 012 

d d 

(9) 8 1 (3) 8 1 

1452 1452 

2 7 6 [3] 2 7 6 9 

Before After 

The next value is 9, subtracting 1 is 8. The computed destination for where the value 
goes is (2,2). 

in = 012 in = 012 

d d 

(3) 8 [1] (1) 8 3 

1452 1452 

2769 2769 

Before After 

The next value is 3, subtracting 1 is 2. The computed destination for where the value 
goes is (0,2). 

m = 012 m = 012 

d d 

<1> 8 3 1 (8) 3 
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1 4 5 2 

2 7 6 9 



1 4 5 2 

2 7 6 9 



Before After 

The next value is 1, subtracting 1 is 0. The computed destination for where the value 
goes is (0,0). 

Here is an example of hysteresis noted by the angular brackets; the start and final 
positions are equal, so the value has mapped back onto itself. The start position is 
'forced' to the next position or the algorithm would be stuck in an infinite loop and 
so would " stall" . 



m 


= 


1 


2 


m 


= 


1 


2 










d 











1 


(8) 


3 





1 


(6) 


3 


1 


4 


5 


2 


1 


4 


5 


2 


2 


7 


[6] 


9 


2 


7 


8 


9 



Before After 

The next value is 8, subtracting 1 is 7. The computed destination for where the value 
goes is (2,1). 

in = 012 in = 012 

d d 

1 (6) 3 1 (2) 3 

1 4 5 [2] 14 5 6 
2789 2789 

Before After 

The next value is 6, subtracting 1 is 5. The computed destination for where the value 
goes is (1,2). 

m = 012 m = 012 

d d 

1 <2> 3 12 3 

1452 1456 

2789 2789 

Before After 

The next value is 2, subtracting 1 is 1. The computed destination for where the value 
goes is (0,1). 

Here is another example of hysteresis; the start and final positions are equal, so the 
value has mapped back onto itself. The start position is 'forced' to the next position 
or the algorithm would remain stuck. 
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The matrix is now sorted, but the algorithm has no way of knowing this. It will 
continue from 3 to 9 until the hysteresis count equals the size of the data N. The 
remaining sorting data has caused "hysteresis" by causing the algorithm to lag in a 
sense, getting tripped up over already sorted data. In an ideal arrangement of the 
data, no hysteresis would occur; then the algorithm knows it has finished sorting with 
the exchange count equals the size of the data N. But unfortunately, this is an ideal 
situation which most likely will never occur in practice. 

Direct Hash Sort Exeimple with 2x2 Matrix 

Matrix 

m = 1 

d 

<*,*> <*,*> 

1 <*,*> <*,*> 

The values of the data set are from 7 to 10. A list L of size 7 of data elements to 
be mapped will be used. Each location in the matrix is the value, and an associated 
count of the number of values in that location. Once all the values from the list L are 
mapped, the hash sort algorithm will terminate. 

The hash function is: 

d = (x - 7) div 2 ; m = (x - 7) mod 2 

As each value is mapped, an brief explanation of the process and what happens to 
the matrix will be given. As the list L is mapped, it will become progressively smaller 
in representation through the walk-through. Once the list size is zero, the hash sort 
will be complete. The left-most element in the list L will be the one being mapped 
by the hash sort algorithm. 

L = { 7, 8, 7, 9, 10, 7, 8, 8 } 



Matrix 



m = 1 

d 

<*,*> <*,*> 



1 <*,*> <*,*> 



Matrix Initial Configuration 
Step 

The first value is 7. The hash sort maps the value to the location in the matrix 
as d = (7 -7) div 2, m = (7-7) mod 2, which is (0,0) 
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L = { 8, 7, 9, 10, 7, 8, 8 } 



Matrix 
m = 1 

d 

<7,1> <*,*> 

1 <*,*> <*,*> 

Step 1 

The next value is 8. The hash sort maps the value to the location in the matrix 
d = (8-7) div 2, m = (8-7) mod 2, which is (0,1). 

L = { 7, 9, 10, 7, 8, 8 } 

Matrix 
m = 1 

d 

<7,1> <8,1> 

1 <*,*> <*,*> 

Step 2 

The next value is 7. The hash sort maps the value to the location in the matrix 
d = (7-7) div 2, m = (7-7) mod 2, which is (0,0). 

L = { 9, 10, 7, 8, 8 } 

Matrix 
m = 1 

d 

<7,2> <8,1> 

1 <*,*> <*,*> 

Step 3 

The next value is 9. The hash sort maps the value to the location in the matrix 
d = (9-7) div 2, m = (9-7) mod 2, which is (1,0) 

L = { 10, 7, 8, 8 } 
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Matrix 



m = 1 

d 

<7,2> <8,1> 

1 <9,1> <*,*> 

Step 4 

The next value is 10. The hash sort maps the value to the location in the matrix 
d = (10-7) div 2, m = (10-7) mod 2, which is (1,1). 



L = { 7, 8, 8 } 

Matrix 
m = 1 

d 

<7,2> <8,1> 

1 <9,1> <10,1> 

Step 5 

The next value is 7. The hash sort maps the value to the location in the matrix 
d = (7-7) div 2, m = (7-7) mod 2, which is (0,0). 



L = { 8, 8 } 

Matrix 
m = 1 

d 

<7,3> <8,1> 

1 <9,1> <10,1> 

Step 6 

The next value is 8. The hash sort maps the value to the location in the matrix 
d = (8-7) div 2, m = (8-7) div 2, which is (0,1). 
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L = { 8 } 



Matrix 
m = 1 

d 

<7,3> <8,1> 

1 <9,1> <10,1> 

Step 7 

The next value is 8. The hash sort maps the value to the location in the matrix 
as d = (8-7) div 2, m = (8-7) div 2, which is (0,1). 

L = { } 

Matrix 
m = 1 

d 

<7,3> <8,2> 

1 <9,1> <10,1> 
Matrix Final Configuration 

All 7 data elements have been mapped by the hash sort into the matrix. There 
is no hysteresis with the direct hash sort, as all data elements are mapped by value 
into the appropriate location. Moreover, the time of the direct hash sort is linearly 
proportional to the size of the data list L. 

2.6. Other Similar Algorithms. 

There are three other algorithms that have very strong similarities to the hash 
sort algorithm. These algorithms are: address calculation sort, bin sort, and the radix 
sort. While similar, these algorithms are distinct from the hash sort. 

2.6.1. Address Calculation Sort:. 

The address calculation sort is very similar to the hash sort. The address calcu- 
lation sort is sometimes referred to as sorting by hashing. The address calculation 
sort uses a hashing method that is order-preserving, similar to the hash sort. How- 
ever, the address calculation has the problem that if the distribution is not uniformly 
distributed, then the address calculation sort degenerates into an 0{N'^) time com- 
plexity. This is the sparsity problem with the hash sort, but it does not lead to such 
an extreme degeneration in the hash sort as it does with the addrc^ss calculation sort. 
Another variant of the address calculation sort is the pigeonhole sort, in which the 
data list of elements is subdivided into bins, and then within each bin the sub data 
list is sorted. 

21 



2.6.2. Bin Sort:. 

Bin sort is similar to hash sort in that data is stored in a " bin" which it is mapped 
to. The bin sort, has multiple distinct values mapping to a similar bin. Unlike the 
hash sort, where redundant data values map to the same location, the bin sort has 
distinct elements possibly mapping to the same bin. So the bin sort within each bin 
has multiple data elements within the same bin. If there a N elements, and M bins, 
then the bin sort is linear time 0{N + M). However, if the number of bins is N'^ , 
then the bin sort will degenerate into a worst case of 0{N'^). 

2.6.3. Radix Sort:. 

The radix sort is similar to the hash sort in that the digits or sub-elements of each 
data value are used in the sort. The algorithm uses the digits of the data element 
to map it to its unique location. Hash sort does this indirectly not by each sub- 
element, but by mathematical mapping. Radix sort for m-sized data element with n 
elements has a time-complexity of 0{M ■ N). If the sub-data elements become very 
dense, then m becomes more approximately logN , then the radix sort degenerates to 
a 0{N ■ log N) algorithm. So hence, the radix sort depends on M much less than A'" 
by a sizable ratio. 

2.6.4. Summary of Similar Algorithms. 

The similarity between the address calculation sort, bin sort, and radix sort is 
that a non-comparative method for sorting is used. However, all three algorithms 
degenerate into worst case in the very least being a comparative sort algorithm, or 
polynomial time algorithm. This worst case occurs at the extremes of data density, 
either too sparse or too dense, which then overloads the algorithm. The hash sort, 
while not incapable of degenerating, only becomes worst case of another linear time 
constant. So the hash sort is not as sensitive to extreme cases, and is more robust. 

2.7. Features of Hash Sort. The hash sort has the following strengths: 

• Linear time complexity, even in the worst case; for the in-situ proportional to 
the data structure size, and for the direct proportional to the data list size. 

• The hash sort puts data elements in the correct position, does not move them 
afterward - data quiesence 

• Data independence - data elements in the data set are mapped by their unique 
value, and do not depend on the predecessor or successor in the data set 

• High speed lookup is possible once the data is sorted - faster than binary 
search; or alternatively, to the approximate location within the data structure. 

The hash sort has the following weaknesses: 

• Sparsity of data values in range - wasteful of space 

• Multi-dimensional data structure is required - square planar matrices are 
inconsistent with underlying linear memory in one-dimension 

• Works only with numeric values, requires conversion for non-numeric value 

• The data range of values must be known for the algorithm to work effectively 

3. Testing. 
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3.1. Testing Methodology. 

The testing methodology of the sort algorithm involves two perspectives, an em- 
pirical or quantitative viewpoint, and a mathematical or qualitative view. The math- 
ematical or qualitative point of view looks at the hash sort algorithm alone. The 
algorithm is tested for its characteristic behavior; this testing can be exhaustive, as 
there are an infinite number of sizes of test cases to test the algorithm with. The 
mathematical methodology then, tests for data size, and data arrangement or struc- 
turing. These form tests for which the size of the test data list is increasing, and 
tests in which there are partially sorted sublists within the overall test data list. For 
simplicity, the testing emphasis was placed on the in-situ version of the hash sort. 

The testing approach for determing the behavior of the hash sort algorithm focus 
on the algorithmic time complexity. Testing is non-exhaustive, as all possible test 
cases can not be generated or tested upon. The algorithmic behavior is tested on 
different test cases to avoid the possibility of an anomalous "best" case or extreme 
"worst" case scenario. The hash sort algorithm is tested on different sizes and dif- 
ferent permutations of data lists to evaluate the algorithmic performance. Complete 
data lists of unsorted, or fully sorted lists are used, along with partially sorted data 
lists. The hash sort is also compared to other algorithms that sort, to give relative 
comparisons and contrasts for better assessment. 

The two other sorting algorithms used to compare and contrast the hash sort 
are the bubble sort, and the quick sort. Each algorithm represents an extreme in 
algorithmic performance. The bubble sort is an 0(n2) algorithm, but has excellent, 
linear 0{N) performance on partially sorted data lists. The quick sort is known as the 
best, and fastest sorting algorithm of 0{N log N) performance. However, the quick 
sort does falter on already sorted data lists, degenerating into a 0(n2) time complexity 
algorithm. Thus, the bubble sort is best when the quick sort is at its worst, and vice- 
versa. Again the extremes of algorithmic performance in pre- existing algorithms have 
been selected for comparison and contrast with the hash sort algorithm. 

3.2. Test Cases. 

Testing on data size looks at increasing data sizes and the rate of growth in the 

time complexity of the algorithm. Testing on data size is on "complete" lists of data 
elements, lists which are fully sorted or are unsorted. These types of "complete" data 
test cases form the extremes of the other variations in the data test case. The sorted 
lists are either fully sorted in ascending or descending fashion, and the unsorted list 
is the median between the two extremes. Size testing looks at the effects of increasing 
sizes on these three types of "complete" sorted data test cases. 

3.3. Test Program. 

The test program has to handle to important issues dealing with the program 
code, and the test platform. The test platform, or the computer system the code is 
run on, the test program must not be biased by any hardware or platform architecture 
enhancements. So a general-purpose computer, without any specific enhancements or 
features is used. For further platform independence, different computers should be 
used, of different sizes and performance to be sure of the independence of the results 
obtained. 

The code must be written in a " neutral" computer language to avoid any esoteric 
features of a language, and be readable. Similar to platform independence, the tests 
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must be independent of the language implementing the test programs. Any program- 
ming language features which give a biased advantage in the generated code being 
optimized or better for is to be avoided. 

The code must be readable, so a programming language which is expressive along 
with good programming style must be used. Unreadable code will hamper duplication 
of results by others and cast doubt on the efficacy and credibility of the tests. The 

code must be portable, so that independent platform testing can be conducted, and 
require minimum effort to re-write and re-work the code for continued testing. The 

test case generator and the test sort programs are integrated together in one program. 
The test case generator generates the particular test case in internal storage within 
the program, and the hash sort, bubble sort, and quick sort all process the data. The 
time to process the data is noted and written to an output data file. The time in 
microseconds, and the size of the data set is recorded by the program. 

3.3.1. Test Platform Specifications. The table summarizes the different com- 
puter systems which were used to test the hash sort. 



Processor: 


Sparc 


Sparc 


Cyrix i686-P200 


Hardware: 


sun4u 


sun4m 


vt 5099A 


Operating System: 


SunOS 5.5.1 


SunOS 5.5.1 


Linux 2.0.33 


Processors: 


12 


4 


1 


Total Memory: 


3072 Megabytes 


256 Megabytes 


64 Megabytes 


GNU gcc Version: 


2.7.2.3 


2.7.2.3 


2.7.33 



The original Pascal implementations were converted to C using the the p2c con- 
verter. The generated C code was hand optimized to eliminate inefficiencies and 
to follow strict ANSI C compliance. The code was modified for the UNIX system 
environment, in particular the system timing calls. Some of the routines such as 
write_matrix and writeJist were originally used to check that the output was a sorted 
list or array - verifying the algorithms were working. Once this had been checked, 
the call to the procedure was commented out, as the actual test program dumped 
its output to a data file. These routines in the final run for testing timing are still 
present in the source code. 

4. Analysis of Results. 

4.1. Test Expectations. 

The expectations of the tests are for the quick sort and bubble sort to perform as 
outlined. Both algorithms have been extensively studied over time, so any change in 
these algorithm's performance and behavior would be a shocking surprise. The hash 
sort is expected to remain linear, even as the bubble sort and quick sort fall apart 
in their worst case scenarios. The hash sort will falter, but will not degenerate as 
badly as the bubble sort and quick sort do in the extreme cases. Nor is the hash sort 
expected to have any special cases of superior performance. 

The hash sort will have linear time complexity performance which will have a 
Big-Oh of 0{N) in relation to the test case size. This consistency and stability of 
the algorithm through different cases where the comparison algorithms degenerate or 
accelerate performance is expected to be verified in the tests. The worst case of the 
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algorithm performance will be a constant multiple of linear time c • N, but it will 
remain linear time-complexity. 

4.2. Test Case Results. 

The table of the run-time performance of each algorithm in the appendix shows 

an increasing time with increasing size of the data set. The bubble sort "explodes" as 
expected, and soon is surpassed by the hash sort and quick sort. Both the hash sort 
time and the quick sort time increase, but what is of interest is the ratio of the two. 
The hash sort does not immediately surpass the quick sort, in fact it is not until the 
data set > 146 that this occurs. A steady trend of the hash sort gaining occurs up 
to data set N < 120. When the data set is in the range (120 < N < 146) the hash 
sort to quick sort performance ratio occasionally is less than one. The performance 
fluctuates until data set N > 145, when the ratio of the hash sort to the quick sort 
< 1.0, meaning the hash sort is performing faster than the quick sort. A decreasing 
ratio was expected, but the hash sort did not immediately make gains on the quick 
sort until the data set was much larger. 

5. Conclusions. 

5.1. Testing Conclusions. 

The hash sort performed as expected, as did the bubble sort and the quick sort 
algorithms. What was surprising initially is that the hash sort does not have a per- 
formance lead over the quick sort. The bubble sort was soon surpassed by the hash 
sort as expected, but the hash sort did not exceed the performance of the quick sort 
as it should have theoretically. 

The reason for this seemingly strange performance of the algorithm is the theoret- 
ical consideration that the hash sort and the quick sort operate using the same types 
of operations with the same underlying machine code clock cycles. This of course is 
not correct, the quick sort is using a compare based machine instruction, whereas the 
hash sort is using an integer divide machine instruction. 

A comparison of the clock cycle times for the compare and integer divide instruc- 
tions on a Intel 486 processor gives some indication of this (Brey pp. 723 - 729). 



Opcode 


Addressing Mode 


Clock Cycles 


CMP 


register-register 


1 


CMP 


memory-register 


3 


DIV 


register 


40 


DIV 


memory 


40 


IDIV 


register 


43 


IDIV 


memory 


44 



The ratio of a compare instruction to a integer divide is 1:42 for register to 
register based machine instructions, and 1:21 for memory based machine instruction. 
The summary of this is that the quick sort and the hash sort are utilizing different 
machine instructions to implement the algorithm, and theoretically they are equivalent 
at an abstract level. At the implementation level, this is not a valid assumption to 
make. A simple analysis of the algorithms in this context shows this. 
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F{Quick sort) 


> 


F{Hash sort) 


C2- N ■ log2N 


> 


Ci ■ 2 ■ N dividing through by A'' 


C2 • log2N 


> 


Ci • 2 dividing through by C2 


log2N 


> 


ci • 2 substituting N =^ 
move square down 


^052(1462) 


> 


2 • /o.g2(146) 


> 


divide through by 2 


/052(146) 


> 


^ evaluate the log 


14.38 


> 


£1 

C2 



This rough analysis is an approximation of the arithmetic instruction time to 
comparison instruction time. It does not directly correspond to the ratios of clock cy- 
cles for the Intel instructions. There are other factors, such as compiler optimization, 
and memory manipulation overhead to be considered. But this rough analysis docs 
give some insight into the " critical mass" that the hash sort must reach before it can 
exceed a comparison based algorithm - in this case the quick sort. 

5.2. Evaluation Criteria. The three algorithms used are to be evaluated on 
the basis of the following criteria: 

• coding complexity - how difficult or easy is it to code the algorithm? 

• time complexity - how fast is the algorithm in performance? 

• space complexity - how efficient is the algorithm in using memory? 

To rank the algorithm on each of the criteria, the simple expedient of following 
a grading system is used. Rankings are: excellent, good, fair, poor, bad. A comment 

as to the ranking and why it was judged to rate such a score is given. 

5.3. Evaluation of the Algorithm. The hash sort has the following rankings: 

• coding complexity: excellent; the hash sort is a short algorithm to encode, 
uses only a simple iterative loop and array manipulation. 

• time complexity: good; the hash sort is linear in theory, but in practice must 
reach a certain critical mass, so its overall time complexity is more degenerate 
than the theory behind the algorithm would indicate. 

• space complexity: fair; the hash sort has some overhead in the algorithm, but 

of which most arc counters which are incremented as the algorithm progresses. 
But the algorithm is fitting a n-dimensional array onto a one-dimensional 
memory so this is somewhat awkward and introduces memory manipulation 
complexity overhead. 

5.4. Comparison to Test Algorithms. 



Criteria 


Hash Sort 


Quick Sort 


Bubble Sort 


Coding Complexity 


Excellent 


Fair 


Excellent 


Time Complexity 


Excellent 


Good 


Poor 


Space Complexity 


Fair 


Good 


Good 
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5.5. Summary of Sort Algorithms. 

The hash sort compared to the bubble sort is similar in coding complexity, very 
short code which is simple and easy to follow. However, it is much faster than the 
bubble sort, although it is not as efficient in using memory. The bubble sort is much 
more efficient in using memory, but it does have much data movement as it sorts the 
data. 

The quick sort is much more complex to code, even the recursive version. A 
non-recursive version of the quick sort becomes quite complex and ugly in coding. 
The quick sort does outperform the hash sort initially, but it is still a linearithmic 
algorithm, and docs have a degenerate worst case, although it is rare in practice. The 
quick sort is good at managing memory, but it is a recursive algorithm, which has 
somewhat more overhead than an iterative algorithm. Because of its partitioning and 
sub-dividing approach, the amount of data movement is less than with bubble sort. 

5.6. Further Work. 

The initial development and investigation into the properties, performance, and 
promise of the hash sort seem thorough, there is still other avenues of research. Further 
research on the hash sort can be done in terms of investigating a recursive version of 
the algorithm, seeing if it can be parallelized, remedying the problem of data sparsity, 
and finding other mapping functions. 

5.6.1. Recursive implementation:. 

The hash sort has been implemented as an iterative algorithm, but some sorts, 
such as the quick sort, are inherently recursive. A recursive version of the hash sort 
may be possible, but it may not translate well into a recursive definition or imple- 
mentation. The possibilities of the hash sort being implemented a smaller mappings 
of the same data set is an interesting possibility. 

5.6.2. Parallelization of algorithm:. 

The hash sort algorithm has been implemented as a serial algorithm on a unipro- 
cessor system, although two of the systems it was tested on had multiple processors. 
It is possible that the hash sort could be parallelized to increase its performance be- 
yond linear time. The tradeoffs of such a parallel algorithm, and the issues involved 
are another possibility for research. 

5.6.3. Sparsity of data:. 

The problem of sparse data within the range of the hash sort is a problem with the 
algorithm. There may be ways of handling sparse data so that space can be utilized 
more efficiently than the implemented algorithm here does. If the problem with data 
sparsity can be resolved, then the hash sort could become a magnitude more useable 
in applications. 

5.6.4. Other mapping functions:. 

The heart of the hash sort is the super-hash functions, which are perfect hashing 
which preserve ordering. This mapping function used the classical hash function along 
with a mash function to hash magnitude. Other mapping functions which have the 
same properties would be very interesting functions in themselves, but could be the 
basis for another variant of the hash sort. A less system complex mapping function 
that did not use the time consuming div and mod operations would be an obvious 
improvement in the hash sort. 
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5.6.5. Machine code optimization:. 

The difference in the machine instructions used to implement the underlying al- 
gorithms makes the hash sort need a " critical mass" to reach a sufficient size before it 
meets its theoretical performance expectations. Optimizations of the machine gener- 
ated code to reduce the size of the critical mass required arc an interesting possibility. 
Such optimizations could be compiler-based, or strictly upon the properties of the 
hash sort algorithm. 

5.6.6. Alternative data structure:. 

The current version of the hash sort uses multiple N square matrices for the data 

structure being mapped to by the super-hash fimction. One possibility to be explored 
is using non-square matrices, and possible other data structures that are not square 
or rectangular. A change in the data structure mapped to reflects a change in the 
mapping function, as the mapping function is determined by the data structure being 
mapped to by the super-hash function. 

6. Application. 

6.1. Criteria for using Hash Sort. 

The hash sort, although a general-purpose algorithm, is not an algorithm that 
meets all needs for applications. Because of its properties and features, there are 
several criteria which guide using it. These criteria are: 

• data set is within a known range 

• data set is numeric or inherently manipulatable as numbers 

• the data needs to be ordered quickly then accessed frequently large data set 
with heavy density within the data range 

6.2. Applications for Hash Sort. 

There applications suited for the hash sort can be surmised from the criteria 
mentioned previously, but some applications include: 

• database systems - for high speed lookup and retrieval 

• data mining - organizing and searching enormous quantities of data 

• operating system - file organization and access 

6.3. Example Application with Hash Sort. 

One interesting application of the hash sort is with data communications. This 
example involves data communication on a X.25 packet switched network. In this 
method of data communications, data is disassembled into a packet, which is then 
sent through the network. At the reception point, the packets are reassembled into the 
original data. One problem with the X.25 packet switched network, the packets are 
out of order when received from the sender, or " ...may arrive at the destination node 
out of sequence." (Stallings 1985 p. 249) That is, data can arrive in sequence as sent, 
but it may not. This is because of "...datagrams are considered to be self-contained, 
the network makes no attempt to check or preserve entry sequence." (Rosner 1982 p. 
117) 
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There is a need to sort the packets once they are received. This is not so straight- 
forward, because there is the offhand chance that the packets may be in order, so 
sorting may be unnecessary. Using a quick sort on a sorted sequence would be the 
worst-case degenerate example, the quick sort would become an 0{N^) algorithm - 
as bad as bubble sort. 

Even worse, all the packets must be received before they can be sorted, if sorting 
is required. To use a quick sort for this application, a separate algorithm to check 
if the packets are already sorted is required, then if not, the quick sort algorithm is 
used. 

Overall this is adding addition time overhead in terms of: 

Total time = Time{receive) + Time{check sorted) + Time{quick sort) 

Total time = Time{r) + Time{N) + Time{ N ■ logN) 

N is the size of the packets received, and to check that the packets are in order 
requires a linear search 0{N) to examine all packets, although not when they are out 
of order. 

The hash sort is a better algorithm to use, although the packets would have to be 
stored in a matrix structure rather than a linear list. A check for the packets being 

received in order is unnecessary, because the hash sort will not degenerate on sorted 
data - it will be an extreme form of hysteresis, but the hash sort remains linear in 
such cases. 

More interesting is that the hash sort does not have to wait for the last packet 
to be received - it can begin to sort the packets as they are received. The hash sort 
maps the packets to where each uniquely belongs. Since no two packets share the 
same location, they are unique. So the hash sort can uniquely determine from the 
packet itself where it belongs in the matrix. As packets are received, they are placed 
into their appropriate location. By the time of the last packet arrival, all the data is 
in place. 

The hash sort may be slower than the data reception time, in which case the 
algorithm will continue to sort the packets as they are stored. The total time in this 
case would be: 

Total time = Time{receive) + Time{hash sort) 

Total time = Time(r) + Time{N) 

This time is the same as if the data were received in the quick sort implementa- 
tion already in order, and the algorithm to check that the data is in order is used. 
This example shows the power of data independence in the hash sort algorithm - a 
data element docs not depend on its predecessor or successor for its location, it is 
determined uniquely by the data element itself. 
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This property of data independence, and the speed of the hash sort and its robust- 
ness in the degenerate case, make such applications where it is apphcable to use it. 
One drawback however, is that the size of the packets sent must be sufficient enough 
to achieve the "critical mass" needed by the hash sort. 

7. Mathematical Proofs. 

7.1. Proof of Injective Mapping Function. 

Proof for a general super-hash function with a n-tuple of 2 

Theorem: 

Given a set S of unique integers in Z such that S = {vi ^ V2 ■ ■ ■ Vn-i 7^ and 
given a function F{x) such that: 

p, ^ _ { p = X div n; Mash function 
^ ~ 1^ g = X mod n; Hash function 

Where n > 0, then the resulting image set M of ordinal pairs formed is M = 
(Pi) (P2, 92), • • • , {pn, Qn) from the pre-image set S is an injective map- 

ping F:S^M. 

Proof: 

Proof by Contradiction: Proceed with the assumption that given S as described 
before that VS, y) E S such that F: X = F:Y when x ^ y, ot that the function 
F: S M is a non-injective mapping. 

The definition of injective (one-to-one) is "A function / is one-to-one if and only 
if f{x) = f{y) implies that x = y iov all x and y in the domain of the function." 

[V(a;,y) e F:X — > F, where (/(a;) = f{y)) D {x = y) is injective] 

By taking the contrapositive of this definition, it is restated as "A function / is 
one-to-one if and only if f{x) = f{y) whenever x ^ y in the domain of the function." 
(Rosen 1991, p.57) 

[V(a;,y) G F:X ^ F, where -'{x = y) D ~'(/(a;) = f{y)) is injective] 
rewrite as 

[y{x,y) G F:X ^ F, where {f{x) ^ f{y)) D {x ^ y) is injective] 

By taking the contrapositive of this definition, it can be restated as "A function 
/ is one-to-one if and only if f{x) ^ f{y) whenever x ^ y. (Rosen 1991, p.57) 

Using the contrapositive of the definition of an injective function, it is readily 

clear that the mapping F: S ^ M is not injective if there are at least two integers ii 
and 12 such that by the mapping function F, {pi,qi) = (p2, 52) in M. This is assumed 
to be true, as a non-injective mapping function. 

By the definition of S, all the integers are unique in Z, so the integers have the 
property that for any two integers x,x' G S so that x ^ x'. 
Take the case for x: 
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Px = X div n 

Px = dx where > 

Take x' = x + c where < c < n: 

Px = x' div n 

Px = {x + c) div n 

Px = {x div n) + (c div n) 

Px = dx + k where dk,k >0 



Qx = X mod n 

Qx = iTT-x where < < n 



Qx = x' mod n 

Qx = (x + c) mod n 

Qx = {x mod n) + (c mod n) 

Qx = nix + c where < mx < n 



so then that for x = x' that F{x) ^ F{x'). But this is for x' = x + c where < c < n. 
Take the case when c > n. Using the same definition for x, take x' = x + c where 
c = n. 



Px 


= x' 


div n 


Qx 


= x' mod n 


Px 


= {x 


+ n) div n 


Qx 


= {x + n) mod n 


Px 


= {x 


div n) + (n div n) 


Qx 


= {x mod n) + {n mod n) 


Px 


= dx 


+ 1 where dx > 


Qx 


= mx + where < mx < n 


Px 


= dx 


+ 1 where dx > 


Qx 


= mx where < < n 



so then that for x ^ x' that F{x) ^ F{x'). 

For each case of the form x' = x + c, where c < n and c > n, if x = x' then 
F{x) = F{x') which contradicts with the original assumption that 3{x, y) G 5 so that 
F:X = F:Y when x ^ y. 

So the occurence of dx = d'^ and = m'^ such that a; = a;' is never true and 
will never occur, which is the only possibility for the mapping F: 5 — > M to be 
non-injective. 

Thus it follows that no two ordinal pairs in M formed from any two integers in S 
by F would ever be equal. Since this is the only counter-example for the definition of 
F{x) to be an inejective mapping, then the only conclusion is that F:S^M must 
be an injective mapping under F{x). 

Q.E.D. 

7.2. Proof of Injective Multiple Mappings. 



Proof using general super-hash function 



Theorem: 



For an injective mapping F: 5 — > M, using a set of mapping value n G N, the set 
of unique mappings values used in the super-hash function, that a function composed 
of multiple sub-mappings of F with n G A'^ is a complete mapping function overall. 
Therefore such a constructed function would be an injective mapping. 

Proof: 



Proof by Demonstration: Define a function G, which non-rcdimdantly selects 
Tlx G N, where a; <|| A/' ||, the cardinality of the set of elements in N. Therefore, each 
selected element is guaranteed by the definition to be unique from the predecessor 
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or successor elements, so that no element once selected as a mapping value will be 
selected again. So the function G selects an element once and only once from N. 

The selection function G is an injective function by its definition. For each rik £ N 
there is an integer from 1 < a; <|| || which is uniquely associated with the particular 
element selected n. The definition of an injective function is "A function / is one-to- 
one if and only if f{x) = f{y) implies that a; = y for all x and y in the domain of the 
function." 

For the particular selection or iteration of G which is fc, there is a unique element 
Uk, selected only once by G from N. Thus, iterations fc — 1, k, and k + 1 will not select 
the same element, so G{k) = nk, so that k = nk for the set N under the selection 
function G. Hence the definition and construction of G make an injective function. 

The mapping function F:S^M has been proven to be an injective mapping 
function previously for n, where n is the mapping value used in the mash and hash 
sub- functions. Composing a new function using the mapping function F with the 
selection function G, the resulting composed function H is an injective mapping. 

For F: S is using a mapping value n, selected from N. So F: S could be rewritten 

as: 

F{n):S 

going one step further, 

F{G):S 

and moreover since G uses A'^ as its domain, then 

F{G :N):S 

The notation is rapidly becoming cumbersome, so in general using a mapping 
function F along with a selection function G, we have a function H = F o G, meaning 
the function if is a composition of the functions F and G. With this in mind, the 
proof that H is & injective function is very straightforward, and involves the property 
that F and G arc injective functions. Therefore, the composite function H is also an 
injective function. 

Proof of composite of two injective functions is also injective is: 

li f : X Y and g : Y ^ Z are injections (injective fimctions), then so is the 
composite ofgof-.X^Z. Suppose gf{x) = gf(x'). Since g{f(x)) = g{f{x')), and g 
is an injection, we must have f{x) = f{x'), and since / is an injection, x = x'. Hence 
5 o / is an injection. (Biggs 1985, p. 31) 

For the function F, is uses a particular mapping value n, whore G selects n £ N . 
So it follows that G{k) : N ^ n, for a, particular iteration or selection k. F{n) : S 
M has already been proved as an injective mapping by and of itself. For multiple 
iterations of k, F{G{k) : N) : S ^ M is an injective mapping. 

The power of F to map from one mapping instance, to muliple k iterations is 
expanded using a mapping value set N, and a function to uniquely select particular 
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elements as mapping values G. Hence the injective mapping property of F composed 
with G has been generalized into multiple mappings. 

Q.E.D 
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